Monte Carlo methods in statistical physics: 
mathematical foundations and strategies 



Michael Kastner 

National Institute for Theoretical Physics (NITheP), Stellenbosch 7600, South Africa 



Abstract 

Monte Carlo is a versatile and frequently used tool in statistical physics and beyond. Correspondingly, the 
number of algorithms and variants reported in the literature is vast, and an overview is not easy to achieve. 
In this pedagogical review, we start by presenting the probabilistic concepts which are at the basis of the 
Monte Carlo method. From these concepts the relevant free parameters — which still may be adjusted — are 
identified. Having identified these parameters, most of the tangled mass of methods and algorithms in 
statistical physics Monte Carlo can be regarded as realizations of merely a handful of basic strategies which 
are employed in order to improve convergence of a Monte Carlo computation. Once the notations introduced 
are available, many of the most widely used Monte Carlo methods and algorithms can be formulated in a 
few lines. In such a formulation, the core ideas are exposed and possible generalizations of the methods are 
less obscured by the details of a particular algorithm. 
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Preface 

This text provides a gentle and mostly self-contained introduction to Monte Carlo methods in statistical 
physics. It focuses on the mathematical and foundational aspects of Monte Carlo, showing under which 
conditions a Monte Carlo calculation can be expected to converge, and what parameters may be tuned in 
order to improve convergence. 

The presentation of the collected material differs in style and rigour from what is typically found in 
physics textbooks. It should be useful as supplementary material for an undergraduate or graduate course, 
but also scientists getting started on Monte Carlo-related research should profit from its reading. Finally, 
even the experienced Monte Carlo practitioner should find some less-known background material of interest 
in the present article, or he might just enjoy the more abstract point of view onto the subject. In particular, 
a glossary is provided which translates some of the most widely used algorithms into the formal language 
developed in the main body of the article. 

1. Introduction 

The vast majority of interacting many-particle systems in physics defies an analytic solution. Therefore 
it is no surprise that numerical methods have become increasingly popular over the last decades, up to the 
point where computational physics, besides experimental and theoretical physics, is considered as one of 
the three major pillars of physics. Monte Carlo methods are among the most widely and frequently used 
numerical techniques, with applications in statistical physics, quantum mechanics, field theory, and others. 
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Throughout the present article, the language of Monte Carlo methods in classical statistical physics is used, 
but the concepts presented are not restricted to that field. 

In its simplest version, a Monte Carlo simulation consists of only a few lines of computer code, and this 
simplicity renders it relatively easy to get started on the subject. Therefore it is no surprise that virtually 
any textbook on the subject starts by describing something like the "standard version" of a Monte Carlo 
simulation (single spin flip Metropolis algorithm) a nd then proceeds to m ore advanced alg orithms by mod- 
i fying some ingredients of this standard version [see iLandau and Binder! ((2000;) and New man and Barkema 
( 19991 ) for introductory textbooks on Monte Carlo simulations in statistical physics]. When learning about 
advanced methods, it becomes obvious that the general Monte Carlo scheme has several "free parameters" 
which may be adjusted in order to improve performance. A clear identification of these free parameters is 
often missing, and the range in which they can be varied is given only implicitly. Even among scientists 
employing Monte Carlo methods, these issues are not as well-known as one might suppose. As I found out in 
many discussions, there is still an astonishingly large number of people believing that, in order to compute 
a Monte Carlo estimator of, say, a canonical expectation value, one necessarily has to perform a canonical 
simulation with Boltzmann-type acceptance rates. This is not only incorrect, but it proved an obstacle for 
the invention of new and more efficient Monte Carlo algorithms in the past. 

A formal presentation of the foundations of Monte Carlo — and that is the main point to be made in this 
article — is useful in order to clearly identify the remaining free parameters of the underlying Markov chain. 
The choice of these parameters is at the basis of both, the power and the flexibility of the Monte Carlo 
method, and following a few basic strategies, convergence of the Monte Carlo simulation may be improved 
significantly. This knowledge is essential to fully exploit the possibilities Monte Carlo has to offer and to 
creatively develop new, efficient algorithms. 

Though only little of the material presented here is new, a good fraction of it has not been cast into 
the present form before. In other instances, certain results are well-known in the mathematical literature, 
but not so among statistical physicists (Theorem 1151 being an example). Other topics, like random number 
generation, error estimation, and the more practical aspects of Monte Carlo are not covered in this article. 

In Sec. m the few basic facts from the theory of stochastic processes which are required for Monte Carlo 
are reviewed, allowing us to introduce Markov-chain Monte Carlo in Sec. [31 A formal description of how 
to construct transition matrices of a Markov chain is then presented in Sec. H) From this formalization, 
it becomes evident which parameters may be adjusted in order to improve convergence. In Sec. [Sj the 
strategies to improve the performance of a Monte Carlo simulation are introduced. Methods and strategies 
going beyond Markov-chain Monte Carlo are briefly discussed in Sec. [HI together with a summary of the 
results presented. The paper is supplemented by a glossary, listing some of the most frequently used Monte 
Carlo methods in the language put forward in the main body of the article. 

2. Markov chains 

The characteristic feature of Monte Carlo methods is that stochasticity, i.e. randomness, is used to 
obtain an approximation of a quantity of interest. A Markov chain is the type of stochastic process which is 
typically employed and to which wc will restrict ourselves throughout this paper. Understanding the notion 
of a Markov chain will be almost all which is necessary for the definition of Monte CarloQ 

First we will define the "arena" on which a Markov chain is to be constructed. 

Definition 1 (State space). Let S be a set with a finite number of elements. Each element of S is called 
a state, and S is called the state space. 

In classical statistical physics, the state space is typically the configurational space or the phase space of 
the system. Which state space we have to choose will become obvious from what is called a "quantity of 
interest" in Definition [71 



^Note that, for brevity, the term "Monte Carlo" is used in the following, although "Markov chain Monte Carlo" would be 
more precise. 
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The restriction of finiteness is imposed on S merely for simplicity, and a generalization of the following 
to a state space with an infinite number of elements is possible. However, regarding the finiteness of the 
computer on which we intend to implement the Monte Carlo method later on, we will be confined to a finite 
state space anyway. If the state space of the system we want to investigate is infinite, for example in the 
case of continuously varying values of positions and momenta or continuously varying spin variables, we will 
be forced to do some sort of discretization or coarse-graining in order to obtain a finite state space, suitable 
for an implementation on a computer. 

Although finite, the number of states will in general be very large. It is this fact which forces us to 
revert to an approximate analysis like the Monte Carlo method instead of an exact calculation. In very 
many applications in statistical physics, the state space has a product structure, being the product of the 
state spaces of the (large number of) subsystems constituting the statistical system. This product structure 
is often exploited when constructing the proposal matrix (see Sec. |^ and several entries in the Glossary). 

In order to describe the stochasticity involved in Markov chains, a few elementary definitions from 
probability theory are needed. 

Definition 2 (Distribution). A vector X = {Xi : i €z S) is called a distribution on S if its elements Xi satisfy 
the conditions 

1. Xi>0 for all i e S, 

2. EA, = 1. 

We will use distributions to describe the probabilities of the random outcome of the Monte Carlo method. 

Definition 3 (Stochastic matrix). A matrix T — (Ty : i,j S) is called a stochastic matrix on S if all its 
row vectors are distributions, i.e. 

1. Tij > for all i,i € S, 

2. = I for all j G S. 

It is important to note (and easy to verify) that the stochasticity property of a matrix T ensures that, 
if A is a distribution, the product TA yields again a distribution. Thus, from a given distribution, we can 
generate a set of distributions by multiplication(s) with a stochastic matrix. 

For a given matrix T, there might exist one or more distributions which show the special property of 
stationarity with respect to T. 

Definition 4 (Stationary distribution). A stationary distribution of a matrix T is a distribution tt such 
that Ttt — n. 

Rephrased in terms of linear algebra, tt is a right eigenvector of T with eigenvalue 1. These stationary 
distributions will play an important role for the question whether, and at which speed, a Monte Carlo 
estimator converges towards the quantity of interest (addressed in Sec. [3]). 

Definition 5 (Random variable). A random variable X on a finite state space S is a variable which takes 
on values from the state space S , depending on the outcome of a random experiment. If the probability to 
obtain a certain value i G S as an outcome of the experiment is Xi, the vector X = {Xi : i ^ S) is said to be 
the distribution of X. 

For our purposes this will do as a working definition of a random variable. For a rigorous definition of 
a r andom variable as a mapping from a probability space to a measure space, see for example the textbook 
Baueil {1995). As a simple example, the reader might consider a random variable which describes the 



see 



outcome of the toss of a coin, where the state space S — {head, tail} consists of two possible outcomes of 
the experiment with probabilities Ahead — Ataii = 1/2. 

Now we will gather the preceding definitions to define a Markov chain as a certain sequence of random 
variables. 
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Definition 6 (Markov chain). Let T be a stochastic matrix and Xn a random variable on S. The sequence 
{^n}n=Q of random variables Xn is called a Markov chain of length N on S with transition matrix 1 and 
initial distribution X if 

1. Xq has distribution A, 

2. for n > I, conditional on X„ = i, the random variable Xn+i has distribution (T^ : j G S). 
Such a chain is called Markov{X,T, N). 

It is item two of this definition which distinguishes Markov chains from other stochastic processes: the 
fact that the distribution of the random variable Xn+i depends exclusively on its immediate predecessor X„, 
but not on earlier ones like Xn-i, Xn-2, ■■■ "A Markov chain has no memory" is an expression frequently used 
to put this property into words. The Markov property is convenient for the implementation of a stochastic 
process on a computer: One is not obliged to keep track of the history of the system, as only the current 
state of the system has to be considered in order to specify the next. 

The transition matrix T contains all the probabilities necessary to obtain Xn+i from its predecessor Xn. 
If we have Xn = j, the probability to find Xn+i = i is Tij. As a simple consequence it is found that, 
if Xq = j, the probabi lity to find X n = i is given by {T")y, the ij-th element of the n-th power of the 
transition matrix T(see&i Il998l for a proof). We can therefore write a Markov(A, T, N) chain explicitly 
as the sequence 

{Xn}nJ = {A,TA,T2A, . . . ,T~-lA} . (1) 

With the notion of a Markov chain at hand, we are now in the position to define the Monte Carlo method. 



3. Monte Carlo 



In equilibrium statistical physics, one is often interested in quantities which involve a summation or 
integration over the state space of a system. Typical examples include partition functions or ensemble 
averages of quantities like energy or magnetization. Even for systems consisting of a modest number of 
particles, the state space can be very large, too large in fact to perform such a summation or integration 
even on a computer. 

The fundamental concept of the Monte Carlo method is to replace the summation over the state space 
S" by a summation over a Markov chain on S. Thus, instead of summing over all elements from S, only 
a sample of randomly chosen states is considered. This will of course lead to an estimator of the desired 
quantity instead of an exact result. 

Definition 7 (Monte Carlo estimator). Let F = f{i) be a quantity of interest, where f is a function on 
S. We call 

the Monte Carlo estimator of F, where {Xn}n=o ^'^ Markov{X,T, N) on S, and vr = (tt; : i G S) with tt^ > 
is the stationary distribution of the transition matrix T. 

As an example, consider as a quantity of interest the canonical expectation value of the energy, 

where H is the Hamiltonian, (3 is the inverse temperature 1/T (with Boltzmann's constant set to unity), 
and 

Z(/?)=^exp[-/3i7(z)] (4) 
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is the canonical partition function. Then we define f{i) — H(i) exp[—(3H{i)]/Z{f3). If, by chance or by 
habit, we happen to choose a Markov chain with 



TT = = exp[-(3H{i)]/Z{(3) -.leS) (5) 

as its stationary distribution, we find that the Monte Carlo estimator of the canonical expectation value of 
the energy, 

is simply given by the average of H(i) over the Markov chain. 

Note that, although it might appear "natural," it is by no means necessary to employ a Markov chain 
with a Boltzmann-type stationary distribution ([5]). If, for whatsoever reason, we had preferred a Markov 
chain with, say, a Gaussian stationary distribution 

TT = (tt; oc exj>{-a[H{i) - b]^} : i e S) (7) 

with parameters a and b, we still could have computed a Monte Carlo estimator of the canonical expectation 
value of the energy by means of the formula 

~ E^e{x„}"J„l'f^(«)exp[-/3i^(^)]/7r, 
E»G{x„};:'roi exp[-/3i^(^)]/7^i 

Although the Boltzmann distribution is part of the given physical problem, it need not be part of the 
problem's solution. In fact, it will depend on the physical system of interest whether it is a good idea or not 
to use the Boltzmann distribution also as the stationary distribution of the Markov chain. We will come 
back to this issue in Sec. [51 

The rest of this paper (like virtually all scientific work published on Monte Carlo) is dedicated to the 
question of how to optimize the quality of this estimator. As indicated by the subscripts of F in Definition 
[ll the Monte Carlo estimator depends crucially on the parameters of the Markov chain employed. These 
parameters are therefore at our disposal in order to tune the Monte Carlo method: 

1. The initial distribution A, 

2. the transition matrix T, and 

3. the length N of the Markov chain (number of elements). 

It is in particular the transition matrix which is modified in order to increase the quality of the Monte 
Carlo estimator. However, this is done under the constraint of computational efficiency, meaning that a too 
complicated choice of T would, with the same amount of CPU time on a computer, lead to a significantly 
lower length N of the Markov chain which can be achieved. The task of constructing a transition matrix 
appropriate for the system under investigation and the quantity of interest is by no means trivial, and Sec. 
[5] will be devoted to some aspects of this issue. 

Up to now, we do not even know whether, and under which conditions, we have a fair chance of obtaining 
a useful estimator of the quantity of interest at all. In the remainder of this section, the notion of convergence 
is defined for a Monte Carlo estimator, and a condition on the transition matrix is derived which ensures 
convergence. Convergence, albeit not a guaranty, constitutes something like a minimal requirement for a 
Monte Carlo simulation to give reasonable results. 

Definition 8 (Convergence of a Monte Carlo estimator). A Monte Carlo estimator Fx.t,^ of F is said to 
converge if, for a given transition matrix T, 

lim Fx.T,N = F (9) 



almost surely for all initial distributions X. 
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When convergence is guaranteed, we know that, at least in the hniit of infinite length of the Markov 
chain, the method provides a Monte Carlo estimator arbitrarily close to the exact result of the quantity of 
interest. 

In Definition [8l the remaining free parameter is the transition matrix T. To specify the conditions on 
the transition matrix which ensure convergence, we need the notions of aperiodicity and irreducibility. 

Definition 9 (Aperiodic matrix). A matrix T = (Ty : i,j e S) is called aperiodic if, for any i £ S, there 
exists a natural number ni G N such that (T")^ > for all n ^ n^. 

Definition 10 (Irreducible matrix). A matrix T — (Ty : i,j £ S) is called irreducible if, for any i,j G S, 

there exists a natural number Uij G N such that (T"'^ )^-,- > 0. 

As already mentioned, the 7i-th power of the transition matrix T yields the distribution of the n-th 
element of the Markov chain, conditional on the initial distribution. Hence, in case the transition matrix 
is aperiodic and irreducible, for all sufficiently large n we have a non-zero probability to find any state of 
the state space, independently of the initial distribution. As a consequence, an aperiodic and irreducible 
transition matrix prevents us from "getting stuck" in any partition of the state space while "leaving out" 
another part. This is an essential prerequisite of the following theorem. 

Tiieorem 11 (Ergodic theorem). Let T be an aperiodic and irreducible stochastic matrix on a finite state 
space S. Then there exists a unique stationary distribution {ni : i £ S) of T. Further, let {Xn}^^Q be 
Markov{X,T,N). Then, for a bounded function f : S ^R, 

^^^^ E /» = E-^/» (10) 

almost surely for any initial distribution A. 

Proof: See for example HorHi (Il998l) . Ch. 1.10. 



The ergodic theorem allows us to read off immediately the conditions on a Monte Carlo estimator 
guaranteeing convergence: Employing a Markov chain with an aperiodic and irreducible transition matrix, 
a Monte Carlo estimator converges in the limit of infinite chain length. 

In an implementation of the Monte Carlo method on a computer, however, the length of the Markov 
chain is related to the running time on a computer, and is therefore finite. Hence, for a given system, we 
are left with the question which choices of transition matrices yield good estimators within a certain finite 
time. Before addressing this problem in Sec. [HI we will describe a method how to construct a transition 
matrix with a given stationary distribution. 



4. Construction of transition matrices 



We have seen that a stationary distribution is a characteristic feature of a transition matrix. In ap- 
plications, one typically has in mind a certain distribution tt, like the canonical one in Eq. ([5]), and the 
goal is to construct a transition matrix having that tt is its stationary distribution. As pioneered by 
Metropolis et all (|l953f ). this is conveniently done by constructing the transition probabilities from a pro- 



posal matrix P = (P^j : i,j G S) and an acceptance matrix A = (A^ : i,j G S). Starting from the element 
Xn — j oi a Markov chain, a proposal is made according to the proposal probabilities Pij for i to be the 
next element of the Markov chain. The proposal is accepted, setting Xn+i — i, with probability A^ . It 
is rejected, setting Xn+i = j, with probability (1 — A^). Formalizing this procedure, the elements of the 
transition matrix T can be written as 

= Py Ay + 5,J Pfej (1 - Afej) (11) 

fees 



where 5 denotes Kronecker's delta symbol {5ij — l\i i — j, Sij — else). 
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Why is this commonly done in apphcations of the Monte Carlo method? First, such a procedure is 
straightforward to implement on a computer. Second, as will be discussed below, sufficient conditions on 
P and A can be specified which guarantee a certain distribution to be the stationary distribution of the 
resulting transition matrix. To describe these conditions, the notion of detailed balance is introduced!! 

Definition 12 (Detailed balance). Let A = (Ay : i,j G S) be a matrix and tt = (tt,; : i G S) be a vector. A 
is said to satisfy detailed balance with respect to n if 

kijTTj = kjiTTi (12) 

for all i, j G S. 

Now we are ready to give a recipe of how to construct a transition matrix with a given stationary 
distribution. 

Theorem 13 (Transition matrix with given stationary distribution). Let P = (P^ ■ i,j ^ S) and A — (Ay : 

i,j G S) be matrices and n = (iTi : i £ S) be a distribution subject to the following conditions: 

CI. P is symmetric, i. e., Pij — Pji for all i,j G S, 

C2. P is stochastic (see Definition\B\j, 

C3. < Ay < 1 for all i,j G S, 

C4. A satisfies detailed balance with respect to tt. 

Then the transition matrix T as defined in Eq. ill]) is a stochastic matrix with stationary distribution tt. 
Proof: se e Appe ndix El 

Hastingj (1970) gives a similar recipe for the construction of transition matrices from non-symmetric 



proposal matrices. In or der to construct an acce ptance or transition matrix which satisfies detailed balance, 
a simple recipe given bv ICaracciolo et aL ( 1994h can be employed. 



Theorem 14 (Matrix which satisfies detailed balance). Let f : (0, oo) [0,1] be a function satisfying 
f{z) = zf{\/z) for all z. Then the matrix A — [kij : i.j G S) with elements A^ = flTTi/iTj) satisfies 
detailed balance with respect to n — (tt.; : i G S*). 



Proof: 



A.-. = / ( ^ U.- | = / I ) - A,,,r.. (13) 



Note that, when constructing the acceptance matrix A by this recipe, only the ratios T^i/i^j of the elements of 
the stationary distribution tt enter. This is an important feature since this way the normalization constant 
of TT is not required (and computing this normalization would again involve a summation over the entire 
state space) . It is for this reason that in the following (and in particular in the glossary) prominent examples 
of distributions used for Monte Carlo computations are defined as proportionalities. 

The most common choice for the acceptance function in Theorem[T4]is f : z min{l, z}, associated with 
the names of Metropolis et al. ( 1953f l. An alternative is the Glauber acceptance function f : z i-^ z/{l + z). 



So what have we gained so far? We have described an algorithm which tells us how to construct a 
transition matrix with a given stationary distribution. Although from this algorithm only a subset of all 
transition matrices can be constructed, there are still plenty of choices to be made within this subset, 
namely: 

PI. An initial distribution, 

P2. a symmetric, stochastic proposal matrix, 

P3. a stationary distribution, and 



^An algorithm for the construction of a transition matrix with a given stationary distribution, but without the condition of 
detailed balance for the acceptance matrix has been described bv .Boghosian. (,1999iV 
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P4. an acceptance function. 



The vast majority of Monte Carlo calculations applied in statistical mechanics use transition matri- 
ces which fit into this scheme of "proposal and acceptance," and prominent choices of proposal matrices, 
stationary distributions, and acceptance functions are summarized in the Glossary at the end of this paper. 

Being free to choose a distribution tt on a state space S and then construct a transition matrix T such 
that TT is the stationary distribution of T, we have to cope with the fact that the number of elements in the 
state space is typically very large. The only manageable procedure is to define tt by means of a — preferably 
simple — functional dependence of its elements. 



= {'^i = gi"!-) ■ i S), where g : S ^ [0, 1]. 



(14) 



In statistical physics there exist one or more functions on phase space or configurational space which play 
a fundamental role for the system under investigation. The most prominent example is the Hamiltonian 
function, but one might also think of the magnetization of a spin system, a sub-lattice magnetization, and 
many more. It appears to be "natural," mainly due to the concept of importance sampling described in 
Sec. 15.11 to employ these function(s) also for the construction of the transition matrix. This is typically 
done by choosing a stationary distribution expressed in terms of, for instance, the Hamiltonian function 7i, 
and the standard example is clearly the canonical (Boltzmann) distribution tt^ oc exp[—(3Ti.{i)], with inverse 
temperature f3. Note that properties of the physical system under investigation enter into the transition 
matrix, and thus into the Markov process, only via such a choice of the stationary distribution. 

As an example we now construct the transition matrix for a system of two Ising spins, using some 
standard choices of the Markov chain parameters. Consider a system of two degrees of freedom cti and (T2, 
each of which can take on values from { — 1, +1}. The composite system has a product structure, i.e. a state 
(Ti X (72 of the composed system is an element of the product space C = {—1,-1-1} x { — 1,-|-1}. This is the 
so-called configurational space of the system which we choose as the state space of the Markov chain. The 
Hamiltonian function of the system, a mapping from the configurational space onto the reals, is given by 



(15) 



For notational simplicity, the four elements of C are numbered as follows: 1 = (+1) x (-1-1), 2 = 
(-1-1) x (—1), 3 = (—1) X (-1-1), 4 = (—1) X (—1). The element T21 of the transition matrix, for example, 
yields the probability to go from the state (-1-1) x (-1-1) to the state (-1-1) x (—1). 

As a proposal matrix, we choose a single spin flip algorithm, proposing to change any one of the two 
spins, each with probability 1/2: 



(16) 



As in the example of section [3J we choose the Boltzmann distribution tt,; oc e ''^(*) to be the stationary 
distribution of the Markov chain, 
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The standard Metropolis acceptance function f{z) 
form 
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(17) 
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min{l, z} then leads to an acceptance matrix of the 
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(18) 
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For the transition matrix of the Markov chain we therefore obtain 



T = 



/2(1 



V 



,-2/3 
3-2/3 





3-2/3 

.-2/3 



2(1 



-2/3 



(19) 



and it is easily verified that tt as given in Eq. (|17p is indeed its stationary distribution. 

When implementing the Monte Carlo method on a computer, one would, however, never bother to 
compute all of the entries of the transition matrix T. Fortunately, it is sufhcient to propose a new state and 
compute only the acceptance rate of this particular proposal. 



5. Tuning parameters 

In the previous sections, we have provided a basis for addressing a question which, for all practical 
applications of the Monte Carlo method, is of utmost importance and, at the same time, very difficult to 
answer in general: Given a state space and a certain quantity of interest; for which choices of the remaining 
parameters (initial distribution, proposal matrix, stationary distribution, and acceptance function) can we 
obtain a good Monte Carlo estimator of the quantity of interest within a certain amount of time? An 
additional complication arises from the fact that optimization is desired not with respect to Monte Carlo 
time (length N of the Markov chain), but with respect to the CPU time of the computer on which the 
Monte Carlo run is implemented. Therefore, whenever possible, computationally simple choices of proposal 
and acceptance probabilities are to be preferred. Although this issue will not be discussed in great detail, 
we will come back to it later on. 

It can be regarded as a general rule that, the more about the quantity of interest is already known, the 
better the optimization problem can be approached. In praxi, a rather limited number of strategies how to 
appropriately choose proposal and acceptance matrices is employed, the two most prominent ones being: 

Importance sampling: The stochasticity of the Markov chain gives rise to a statistical error in the Monte 
Carlo estimator ^a.t.at of the quantity of interest F. Choose the transition matrix T such that this 
statistical error is minimized. 

Avoiding effective non-ergodicity: Even if a transition matrix is ergodic, transitions between one block 
of states and another one may be extremely rare on the time scale accessible by computer simulation. 
This situation leads to a systematic error in Monte Carlo estimators, which is to be avoided. 

In short, importance sampling is about reducing the statistical error, avoiding block-like transition matrices 
refers to avoiding systematic errors. In order to obtain a useful Monte Carlo estimator, both these issues 
have to be taken into account simultaneously (at least up to a satisfactory level) , which is often a very hard 
task. In the following, for each of these two concepts, the above developed notations allow us to present the 
important and prevalent Monte Carlo strategies applied in statistical physics (but not only there) in a terse 
and precise way. 

A third concept to improve results from Monte Carlo is by sophisticatedly choosing the quantity F for 
which a Monte Carlo est imator is computed. Examples of suc h quantities include Monte Carlo renormal- 
ization (ISwendsenl . [l979h and transition matrix Monte Carlo (jWang and Swendsenl l2002h . These choices 



depend strongly on the physical problem under investigation, and a discussion of this topic is beyond the 
scope of this article. 

5.1. Importance sampling 

The aim of importance sampling is to choose the transition matrix of a Markov chain such that the 
statistical error present in a Monte Carlo estimator is mi nimal. An answer to the question of which transition 



matrix to choose is provided by the following theorem (jRubinsteinl . Il98l[ ) 
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Theorem 15 (Importance sampling). For a set of random variables {Xn}n=o distributed according to 
TT — {iTi : i G S), the variance of 



is minimal if the proportionality 



holdsE 



TT, cxI/WI^O (21) 



Proof: By making use of Jensen's inequality; see Appendix [Xl 

Frequently, from a single Markov chain, several (or even a continuum of) estimators for different quantities 
of interest are to be computed. Examples include the simultaneous computation of different canonical 
averages like energy and magnetization, or the computation of a certain canonical average for various values 
of the temperature. In this case, what is an optimal choice for one single quantity is a suboptimal choice 
for another, and a compromise has to be found. The ideal compromise can be found by slightly generalizing 
the above theorem, yielding 



TTi OC 



M 



\ m=l 



as the optimal stationary distribution for the simultaneous computation of the M Monte Carlo estimators 

= l m = l,...,M. (23) 

^ f V ^ N~l * 

For the computation of a Monte Carlo estimator of, say, a canonical expectation value of the energy as 
defined in the ideal choice, following Theorem [TKl of the stationary distribution of the Markov chain is 

n, = \H{i)\cxp[-l3H{i)]. (24) 

At first sight, it may come as a surprise to the reader that, nonetheless, such an error minimizing distribution 
is virtually never used in applications. Among the reasons for this are the following: 

• Even among expert computational physicists. Theorem 1151 is not very well-known. 

• If, for a certain choice of the stationary distribution, the computation of the transition probabilities 
Tij is costly in time, it may pay off to choose a 'similar' but computationally simpler stationary 
distribution. This allows to obtain a longer Markov chain within the same amount of CPU time and 
may thus reduce the statistical error. An example is, again, the Boltzmann distribution, where the 
transition probabilities depend only on the energy differences, not on their absolute values. Especially 
for systems with short-range interactions this fact can be exploited to gain computational advantages. 

• There are situations where, in spite of Theorem[T5l the quality of the Monte Carlo estimator obtained 
with a different stationary distribution may be superior: The choice of the stationary distribution 
also has an effect on whether or not the transition matrix has a block-like structure (as discussed in 
the next section), possibly leading to systematic errors in the Monte Carlo estimators. It may pay 
off to choose a stationary distribution which is not optimal in the sense of importance sampling, but 
which avoids a block-like transition matrix. An example is the umbrella sampling (or multicanonical or 
entropic sampling) method for systems whose infinite cou nterparts show a first order phase transition 
(see Glossary and, for example. Berg and Neuhaui ( 199l[ )). 



^The condition 7^ is no serious restriction on the choice of functions /, as the states i for which f{i) = could be 

simply excluded from the state space. Then, however, ergodicity on this reduced state space needs to be fulfilled. 
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5.2. Avoiding effective non-ergodicity 

An often encountered problem which prevents convergence of a Monte Carlo estimator on the time scale 
accessible is the presence of an almost block-like structure in powers T" of the transition matrix for large n. 
"Almost block-like" refers to a situation where, for a suitable choice of the basis, all elements of T" in the 
exterior of some blocks on the diagonal are exceedingly small. This situation can be sketched as T" being 
of the form 

/I 



■ 


>0 


^0 


■ 



for all sufficiently large n. Although, in a strict sense, T might be ergodic, this leads to "effective non- 
ergodicity," meaning that for viable values of the Markov chain length N, certain parts of the state space 
will not be reached from a given initial distribution with very high probability. Hence, the actual realization 
of the Markov chain differs significantly from its expected distribution in the limit N ^ oo, leading to a bad 
Monte Carlo estimator in general. For a transition matrix constructed according to the recipe from Sec. |H 
the emergence of a block-like structure will originate from the interplay of the chosen proposal matrices P 
and acceptance matrices A. Since usually, when constructing acceptance matrices in statistical physics, the 
Hamiltonian or related phase space functions enter, occurrence of a block-like transition matrix will depend 
strongly on the physical system under investigation. The presence of a first-order phase transition is an 
example where such problems are encountered frequently. Due to the dependence on the properties of the 
physical system considered, it is difficult to devise a general solution. Instead, we will briefly outline the 
two principle strategies of addressing this problem by means of an example. 

Consider the Ising model on a two-dimensional square lattice, characterized by the Hamiltonian 

= -J^a.a,, (25) 

where J > is a coupling constant and the angular brackets denote a summation over all pairs of neighboring 
sites on the lattice. Each of the ai (i = 1, . . . , N) is a classical spin variable, taking on the values —1 or +1. 
The states a = (cti, . . . , (Jn) are therefore elements from the product space S = { — 1, 

This model is known to be in a ferromagnetic phase for large values of the inverse temperature /3. 
Vaguely speaking, this means that there is a high probability for the system to be in states with nonzero 
magnetization m = AI/N with 

N 



M(a)=^a„ (26) 



i=l 



but a low probability for observing a magnetization close to zero. Such a probability distribution is given 

by 

V{m) ^J\f^S[M{a)-Nm]exp[-(3H{<T)], (27) 

creS 

where 7\A is a normalization constant. A sketch of V{m) might look approximately like the graph shown in 
Fig.ffl 

A Monte Carlo estimator of V{m) is then obtained by computing 

where we decided not to bother about the normalization constant. According to the scheme from Sec. HI 
we have the parameters P2-P4 to choose when constructing the transition matrix of the underlying Markov 
chain. A standard choice of these parameters is the one also used in the worked example in Sec.[31 
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P2: Single spin flip proposal matrix: pick one of the cr^ with equal probability 1/iV and propose to flip it 

to — (Ti. 



P3: Stationary distribution tTo- oc exp[— (ct)] (Boltzmann distribution). 
P4: Metropolis acceptance function f{z) = min{l,2;}. 

Knowledge of the approximate shape of V{m) as sketched in Fig. [T] indicates already that we might run 
into trouble with these choices: Imagine a given initial state ctq of the Monte Carlo chain such that its 
magnetization M[ao) is located in the left one of the two maxima of the probability distribution V{in) in 
Fig. [H The single spin flip proposal P2 will then propose a new state cri with a magnetization M{ai) = 
M(<7o)±2/7V. For the relatively large numbers of particles N of interest in statistical physics, this corresponds 
to a small change of magnetization. Therefore, in order to also reach the (numerous and important) states 
corresponding to the right maximum of 7-", the process has to proceed step by step through the minimum 
at magnetization zero. This, however, is exceedingly unlikely to happen, and with high probability the 
process will remain trapped in the one maximum of V where it started, inducing a bias in the Monte Carlo 
estimator. 

How can this problem possibly be resolved? Reflecting the way the the transition matrix has been 
constructed, the strategies employed concentrate either on a modification of the proposal matrix, or on the 
choice of the stationary distribution (which in turn effects the acceptance matrix). 

Modifying the stationary distribution. Knowing that the low probability of finding states with magnetiza- 
tions around zero is causing the problems, one strategy is to choose a stationary distribution of the Markov 
chain which favors precisely these states. In this way, proposals to move towards zero on the magnetiza- 
tion axis are accepted more frequently, while those moving away from zero are more often rejected. Some 
examples of this strategy (together with the appropriate references) can be found in the glossary, including 
entropic sampling, multicanonical ensemble, and umbrella sampling. In order to employ this strategy, a 
knowledge of the probability distributions of the relevant observables is of avail, and such knowledge may 
also be acquired by means of preparatory Monte Carlo simulations. 

Modifying the proposal matrix. Instead of favoring the acceptance of states in the vicinity of magnetization 
zero as in the previous paragraph, an other strategy consists in modifying the proposal matrix such that also 
states which correspond to a large step on the magnetization axis are suggested. This can be achieved by 
proposing to flip not only a single spin, but for example a cluster of many adjacent spins. Several variants of 
such cluster algorithm s have been reported in the literature, in cluding the S wendsen-Wang cluster algorithm 
(jSwendsen and Wand . [l987l ) and the Wolff cluster algorithm (jWolfj . Il989l) . 



Almost all strategies employed in Markov chain Monte Carlo fall into one of these two classes, and 
although the details may differ, the key ideas are mostly very similar in nature. 



6. Summary 

In this article, the foundations of Markov-chain Monte Carlo have been presented with the aim of 
identifying the simulation parameters which still may be adjusted. This formal approach is suitable for 
exposing the flexibility of the Monte Carlo method, allowing us to draw on plentiful possibilities when 
designing new and efficient Monte Carlo algorithms. 

We have restricted the presentation in this article to Markov-chain Monte Carlo methods with transition 
matrices that do not depend on time, but discussing a broader class of Monte Carlo methods from the point 
of view we have advocated is straightforward. Let us mention two important examples: 

Deterministic proposal. — In the ?ith Monte Carlo step, flipping of the {n mod N)-th out of N spins 
is proposed with probability one. This is a time-dependant proposal, resulting in a time-depending 
transition matrix. The computational advantage of this scheme relies on the fact that per Monte Carlo 
step one random number less needs to be generated, resulting in a faster algorithm. Convergence issues, 
however, may be more delicate. 
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Wang-Landau algorithm. — This algorithm uses the outcomes of the first n elements of a stochastic 
process for determining the acceptance rates of the (n + l)-th Monte Carlo step, resulting in a non- 
Markovian stoc hastic proces s. This method can significantly speed-up convergence in certain instances 
( Wang and Landa u. 2001; Landau et all . I2004 . 



Having studied the material presented in this article, the reader should be able to identify the key ideas of 
most Monte Carlo algorithms, both Markov chain and beyond. Moreover, it should have become clear how 
flexible a tool Monte Carlo can be. Of course, the present article covers only certain aspects of Monte Carlo. 
In order to write a good Monte Carlo simulation, training in the more applied aspects will be essential as 
well. Hopefully the knowledge of both, the formal and the practical aspects will enable and encourage the 
reader to give free rein to his creativity as what regards Monte Carlo. 
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A. Proofs 

Proof of Theorem llSl 1. T is a stochastic matrix: 

(a) lij > for all i,j £ 5*: follows immediately from the conditions (^and C^of Theorem [T51 

(b) The transition probabilities sum up to unity: 

ies ies ies kes ies keS 

where condition (^has been used in the last line. 
2. TT is the stationary distribution of T; 

(T7r)i = ^ Tij-Kj = ^ Pijf^ijT^] + ^ Sij ^ Pfcj(l - Afej)7rj = ^ PyAyTTj + ^ Pfc,;(l - Afci)7ri. (30) 

jes jes jes kes jes kes 



(29) 



Since P is symmetric we obtain 



(T7r)i = ^ Pij(Ay7rj - AjjTTj) + tTj ^ Pjj, (31) 



which yields tt^ as a consequence of detailed balance. □ 
Proof of Theorem \15l The variance of F is defined as 



2 / , j.^ \ 2 



Since {Xn}n^i is distributed according to tt, the second term is independent of tt. We therefore only need 
to minimize the first term, for which Jensen's inequality provides the lower bound 
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Inserting tt^ = |/(j)|/ \fij)\ into the left hand side of this inequaUty, the lower bound is attained, 



1 ffi^r' 



^ E f 4 E Ei/(.)i H^H- ^^^^ 



Glossary 



In this glossary, a few of the most widely used methods and strategies in statistical physics Monte Carlo 
are formulated in the language introduced in the main body of the article. In this way, the structure of a 
particular algorithm and the main idea behind it are exposed particularly clearly, which makes for example 
the -^Creutz demon appear a bit less demonic than in its original formulation. The list is far from being 
complete, but it should be useful for bringing to life the general ideas outlined in this article. 

To be explicit, whenever any of the concepts in this glossary need to be described in terms of a state 
space function, the Hamiltonian function _ff : 5 ^ M is chosen, and the value of the Hamiltonian function for 
a certain state is called energy E. A generalization to other functions or more than one function is possible 
and straightforward. 

Boltzmann distribution. — Distribution tTj cx exp[—/3H{i)] where /? is a parameter (inverse temperature). 
Canonical distribution. — -^Boltzmann distribution. 

Cluster algorithm. — Proposal matrix which, for a spin system, in order to improve the performance of a 
simulation (see Sec. 15. 2p . proposes to change the value not of a single spin variable (-^Single spin flip 
algorithm) , but of a cluster of spin variables at a time. A cluster is a connected collection of spin vari- 
ables of the same value. Examples inclu de the Swendsen-Wang cluster algorithm ( Swendsen and Wangl . 
Il987t l and the Wolff cluster algorithm (jWolfj . Il989l ). 



Creutz demon. — Step shaped stationary distribution tt^ oc Q{E — H{i)), where E \s & parameter. The 
same name is also used for the distribution tt^ oc 9(— i?i + H{i))<d{E2 — H{i)) where Ei < E2- 

Dynamical ensemble. — Power law shaped stationa ry distribution tt^ cx [{E — H{i))/ B]^^ '^)/'^ ^ where E 
and B are parameters ( Hiiller and Gerlin^ . 1993 ). 



Entropic sampling. — — » Umbrella sampling. 

Gaussian ensemble. — Stationary distribution tt^ oc exp {— a[i/(z) — | , where a and E ar e parameters. 



In the limit a oo, a Microcanonical distribution is approached (lHetheringtonl . ll987l ). 



Glauber dynamics. — Acceptance function of the form f : z ^ z/{l + z) (Glauber, 19631 ). 



Histogram. — For a Markov chain {Xn}n=ij ^ histogram is the number of states which are subject to a 
certain condition. Often energy histograms are employed, i.e., the number of states 

hiE):= ^EMi^) (Gl) 

from a Markov chain with a certain energy E. For the density of states 

n{E)=J2SE.Hi^) (G2) 

of a system, an estimator O = h{E) / {Ntt{E)) can be obtained from a histogram, where 7r(i5) is the 
(energy depend ent) stationary distrib ution of the Markov chain. First applied in statistical physics 
Monte Carlo bv lSalsburg et all (|l959[ ). 
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Histogram reweighting. — Essentially a calculation of canonical quantities from their definition: an es- 
timator of a canonical average is computed from a Monte Carlo estimator of the density of states 
(which in turn is ob tained from a ^Histogram). First applied in connection with Monte Carlo by 
Salsburg et all (|l959f l 



Kawasaki dynamics. — Proposal matrix which, for a spin system, restricts the Markov cha in onto a 
subspace of the state space such that the magnetization is kept fixed ( Kawasaki , 19661 ). This is 
achieved by proposing to interchange the values of the spins at two randomly chosen sites (or at two 
randomly chosen neighboring sites). 



Met ropol is acceptance. 
1953). 



Popular choice of the acceptance function, / 



min{l, z} (jMetropohs et al 



Microcanonical distribution. 
Multicanonical ensemble. — 



- Distribution tt^ oh S[E 
■Umbrella sampling. 



i/(z)]. 



Multihistogram method. — Combining the information contained in -^Histograms from two or more 
distinct Markov chains, typically for different transition matrices, in a way such that statistical errors 
in the resulting data are minimized. For two histograms, hi{E) and h2{E), assuming the statistical 
errors Ahi{E) = y/hi{E) to be those of a Poisson distribution, the statistical error in the density of 
states n(E) is minimized for 

m) . (G3) 



where 



E 



/ii7r| + /i27r^ 
/ii7ri/i27r2 



(G4) 



Put forward by Bennett! ( 1976f) . extended bv iFerrenberg and Swendsen ( 19891 ). 
Parallel tempering. — -^Replica exchange Monte Carlo. 

Replica exchange Monte Carlo. — As in the related method of -^Simulated tempering, improvement of 
convergence is achieved by enlarging the state space of the Markov chain: for a quantity of interest 
which consists of a sum over a state space S, 



(G5) 



an estimator F is obtained by making use of a Markov chain not on S, but on a superspace of S, 



\s\ 



1 1^ fjij) 



(G6) 



ci) e {Xn}n=i- In contrast to simulated 



where the first sum is over all indices («i,r7ii, 

tempering, the chain {Xn}n=i is Markov(A, T, TV) on an even larger space, the product space {S x S*)'"^' 
(where \ S\ denotes the number of elements in 5), restricted such that, for each element (ii, mi, 
from this state space, the rrij form a permutation of the elements of S. The stationary distribution on 
the product space is again a product, (TTi^m)!"^!. The method can be viewed as creating simultaneously 
stochastic chains on l^l copies of the same system but with different stationary distributions, while 
exchanging randomly the stationary distributions of the copies. The exchange of the distributions 
mirrors the above "permutation" -restriction of the state space. The expectation values from these 
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copies can be summed up to improve the Monte Carlo estimator [second sum in Eq. (|G6p ]. The ad- 
vantage with respect to simulated tempering is that, due to the exchange of distributions, the different 
m-values appear eq ually distribu t ed an d no laborious tuning of additional constants is necessary. Put 
forward by Swend sen and Wangl ( 19861 ). 



Simulated tempering — This method aims at improving convergence by using a generalization of defini- 
tion [7] of a Monte Carlo estimator given in Sec. [31 for a quantity of interest which consists of a sum 
over a state space 5, 

les 

an estimator F is obtained by making use of a Markov chain not on but on a superspace of S, 

where {Xn}n=i is Markov(A, T, TV) on the product space S x S. A careful choice of S and the station- 
ary distribution TTi.rn can l ead to a significant improvement of convergence. In the original paper by 
Marinari and Parisil ( 1992[ ). ni^m is chosen Boltzmann distributed exp[— /3„iiJ(i) 4- gm] with different 



inverse temperatures /3m and constants gm, labeled by the indices m G S*. Then, switching between 
states with different m-values can be regarded as considering a system at randomly switching temper- 
atures. Careful tuning of the ^m-values is necessary in order to obtain non-negligible probabilities for 
having states of the various m-values in the Markov chain. 

Single spin flip algorithm. — Proposal matrix which, for a spin system, proposes to change the value of 
only one of the spin variables at a time. This guarantees proposal of a state with an energy similar 
to that of the current state, and therefore a high acceptance rate (see Sec. 15. 2p . A more general 
formulation is the following: on a state space S with product structure, S — x ... x — S^^, and 
states i = {ii, ...^im) G S where im G 5] for all m = 1, ...,M, a single spin flip algorithm employs a 
proposal matrix P = (P^ : i,j G S) with elements 

lO else, 
where |E| is the number of elements in S. 

Umbrella sampling. — Stationary distribution tt^ oc l/n[H{i)] where O is an estimator on the density 
of states il{E) := \S\^^ J^ies ^E.H{i}- This choice of the stationary distribution gives rise to an 
approximately flat energy -^Histogram, 

M^)-^ E '5^.«««^E%f!f -const. (GIO) 



Put forward bv lTorrie and Valleau ( 1977), but variations of this id ea reappeared in the literature under 



the names of multicanonical ensemble ( Berg and Neuhaui . 1991 ) and entropic sampling (,Lee. . ,1993h . 
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Figure 1: Sketch of the probability distribution V of the magnetization m for an Ising system in its ferromagnetic phase. 
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